In this report of our data quality analysis and explratory analysis, Part A demonstrates results on price distribution, room type distribution and pet policy, while Part B shows the results from text analytics. Each part has its own data analysis (section 3) and exploratory data analysis (section 4) as the files used are different and analytic tasks, discussed and decided among all the team members, were executed by Charissa Ding and Chin-Wen Chang, respectively.

Part A: Price Distribution, Room Type Distribution, and Pet Policy

3. Data Quality

I. Missing Data

library(readr)
library(extracat)
library(tidyverse)
library(vcd)
library(RColorBrewer)
library(stringr)
library(vcd)
library(gridExtra)
library(ggthemes)
library(viridis)

listings_short <- read_csv("cleaned_data.csv",col_types = cols())

visna(listings_short)

d<-filter(listings_short,is.na(last_review))
summary(d$number_of_reviews)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000000 0.0000000 0.0000000 0.0008272 0.0000000 1.0000000

Missing data in the dataset has the following main patterns:

Although in the plot above we see blue patterns frequently, the bars on the right indicate that there are only two major patterns – obsercations either have no missing data, or they are missing score information + review information (first, last review, reviews per month) altogether.

We suspect that some rows are missing score and review information because they have low or zero reviews in place. To validate this hypothesis, we filtered the dataset to the ones that are missing ‘last review’, and found that almost all of these rows have 0 number of reviews. This explains the missing data pattern – these places have very few or no reviews, and therefore are also missing score information, which is most likely to be filled out when users choose to leave a comment.

4. Main Analysis (Exploratory Data Analysis)

I. Overview

In part A of the Main Analysis section, we start by looking at some patterns concerning the price and popularity of the listings in New York City.

We first look at a quick summary table that reveals information such as number of listings and number of reviews in each borough, as well as average price per night.

Next, we go into more details and find the top 50 most popular neighbourhoods by number of reviews, and the top 50 most expensive neighbourhoods by average price per night.

listing_by_borough <- listings_short %>% group_by(neighbourhood_group_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews))

listing_by_borough
## # A tibble: 5 x 4
##   neighbourhood_group_cleansed count avg_price ttl_num_rev
##   <chr>                        <int>     <dbl>       <int>
## 1 Bronx                          879      85.8       16154
## 2 Brooklyn                     20445     117        359156
## 3 Manhattan                    22153     183        417541
## 4 Queens                        5079      95.9      103206
## 5 Staten Island                  296     127          6176
most_popular_neighbourhood <- listings_short %>% group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews)) %>% arrange(desc(ttl_num_rev))

most_popular_neighbourhood<-head(most_popular_neighbourhood,50)

most_expensive_neighbourhood <- listings_short %>% group_by(neighbourhood_group_cleansed,neighbourhood_cleansed) %>% summarize(count = n(),avg_price = mean(price),ttl_num_rev = sum(number_of_reviews)) %>% arrange(desc(avg_price))

most_expensive_neighbourhood<-head(most_expensive_neighbourhood,50)

ggplot(most_popular_neighbourhood, aes(reorder(neighbourhood_cleansed,ttl_num_rev), ttl_num_rev,fill=neighbourhood_group_cleansed)) + geom_col() + coord_flip()+
   ggtitle("Top 50 Most Reviewed Neighbourhoods") +
   ylab("Number of Reviews") +
   xlab("Neighbourhoods") +
   labs(fill = 'Boroughs') +
   scale_fill_viridis(discrete=TRUE)

ggsave('images/most_reviewed.png')

ggplot(most_expensive_neighbourhood, aes(avg_price,reorder(neighbourhood_cleansed,avg_price))) + geom_point(aes(colour=neighbourhood_group_cleansed)) + 
  ggtitle("Top 50 Most Expensive Neighbourhoods") +
  xlab("Average Price") +
  ylab("Neighbourhoods") +
  labs(colour = 'Boroughs') +
  scale_color_viridis(discrete = TRUE)

  1. From the summary table we see that Manhattan (22153) and Brooklyn (20445) have the most listings. On average, Manhattan is the most expensive ($183 per night), followed by Staten Island ($127 per night), Queens ($96 per night), and Bronx ($86 per night).

  2. Top five most reviewed neighbourhoods are Bedford-Stuyvesant (Brooklyn), Williamsburg (Brooklyn), Harlem (Manhattan), Hell’s Kitchen (Manhattan), and East Village (Manhattan).

  3. Top five most expensive neighbourhoods are Fort Wadsworth (Staten Island), Westerleigh (Staten Island), Lighthouse Hill (Staten Island), Woodrow (Staten Island), and Tribeca (Manhattan). Although Staten Island doesn’t have many lisitngs, top four most expensive neighbourhood by average price are all in Staten Island.

II. Price Distribution

We then want to put more focus on the prices and draw up two sets of histrogram graphs to learn about general price trend. But before doing that, we first want to look at outliers and remove some of them to reduce skewness.

Outliers

boxplot(listings_short$price,main='Price Boxplot', ylab = 'Price per Night')

hist(listings_short$price, main = 'Price Histrogram',xlab='Price per Night')

summary(listings_short$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0    65.0   100.0   144.3   170.0 10000.0

Here we looked at the price per night range of all the listings, trying to identify outliers. We found that the vast majority of listings have a prive lower than $2000. In order to draw informative histrograms, we only kept observations with a price lower than $750.

Price Histrograms

listing_by_borough <- listings_short %>% group_by(neighbourhood_group_cleansed) %>% summarize(count = n(),avg_price = mean(price),avg_num_rev = mean(number_of_reviews))

listing_leq_750 <- listings_short %>% filter(price<=750)

ggplot(listing_leq_750, aes(price)) +
  geom_histogram(color = 'black', fill = 'gray') +
  ggtitle("Listing Price Distribution") + 
  xlab("Price per Night") +
  ylab("Number of Listings") 

ggplot(listing_leq_750, aes(price)) +
  geom_histogram(color='black',size=0.2,fill='gray') +
  facet_wrap(~neighbourhood_group_cleansed)+
  ggtitle("Listing Price Distribution by Borough") +
  xlab("Price per Night") +
  ylab("Number of Listings")

ggsave('images/price_by_borough.png')

Here we want to explore the distribution of listing prices. We first plotted an overall histrogram and learned that the listing prices are still highly right skewed even after cutting out outliers. Most listings have prices under $200 per night, and the most common prices are between $50 to $100. Considering the average prices of hotels in New York City is somewhere around $200 per night, Airbnb maybe a more economical choice.

We then faceted on boroughs, and see that price distribution in Brooklyn is even more highly skewed, whereas Manhattan’s price distribution is a little more spread out.

More specificly, Brooklyn tends to have a very high number of listings clustered in the price range of $50 to $100. Overall price distribution in Brooklyn is highly concentrated on the cheaper side– over 90% of listings are below $200 per night.

Manhattan on the other hand has a more even distribution. The most frequent prices are still between $50 - $100 but with a much lower percentage. Average prices in Manhattan are high due to longer and fatter tails in the distribution. This tells us that Airbnb in Brooklyn in general are cheaper than the overall distribution in the last graph, while Airbnb in Manhattan is a little bit more expensive.

Consider our findings from the introduction section - the fact that the top two most popular neighbourhoods are both located in Brooklyn maybe partially due to the lower price trends in this Borough.

III. Room Types

We noticed that there are three types of room available across Airbnb postings - private room, shared room, or entire home or apartment. We are interested to see if the distribution of these room types are correlated with boroughs and we decide to use mosaic plot to explore that relationship.

#colorblind-friendly palette
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

p1<-mosaic(main = "Room Type by Boroughs",room_type ~ neighbourhood_group_cleansed, listings_short,gp = gpar(fill =cbbPalette[4:6], col = "white"),labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(10,0,0,10)))

p1
##                              room_type Entire home/apt Private room Shared room
## neighbourhood_group_cleansed                                                   
## Bronx                                              259          583          37
## Brooklyn                                          9054        10969         422
## Manhattan                                        12862         8725         566
## Queens                                            1786         3086         207
## Staten Island                                      136          156           4
png('images/room_type.png')
mosaic(main = "Room Type by Boroughs",room_type ~ neighbourhood_group_cleansed, listings_short,gp = gpar(fill =cbbPalette[4:6], col = "white"),labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(10,0,0,10))) 
dev.off()
## quartz_off_screen 
##                 2

From this plot we see that room types are somewhat correlated with boroughs. More specifically, Manhattan hosts are more likely to list entire home/apt, while Queens and Brooklyn hosts are more likely to list private room within a house or apartment. Since Bronx and Staten Island have very small number of listings, their correlation with room types is nearly neglegible in the bigger picture. However, a further study on room types focused on these two boroughs may bring insteresting insights. Shared rooms are very rare across all boroughs.

Room Types + Price

Since we previously looked at price distribution, here we combine price distribution with room types for further insights.

ggplot(listing_leq_750, aes(price, fill = room_type)) +
  geom_histogram(color='black',size=0.2) +
  facet_wrap(~neighbourhood_group_cleansed)+
  ggtitle("Listing Price by Borough and Room Type") +
  xlab("Price per Night")+
  ylab("Number of Listings")+
  labs(fill = 'Room Type') +
  scale_fill_viridis(discrete = TRUE)

When combine room types and listing prices, we see that as price goes up, room type “Entire home/apt” starts to take a higher percentage, which is consistent with common sense since entire home usually means more space and pricacy, hence higher price. This trend stays true across all boroughs.

IV. Pet Policy

Since some guest may be traveling with their pets, the last section in Part A looks at the percentage of listings that allow pets to stay in their palces.

We first use a mosaic plot to see the pet policy is correlated with boroughs. We then use a grouped bar charts to looked the shares of listings that allow for specific types of pets.

listings_pets <- read_csv("listings_pets.csv", col_types = cols())

ls_pets <- listings_pets %>%
  mutate(pets = ifelse(str_detect(amenities,'pets allowed') | str_detect(amenities,'cat') | str_detect(amenities,'dog'),1,0))

ls_pets <- ls_pets %>%
  mutate(dog = ifelse(str_detect(amenities,'dog'),1,0))

ls_pets <- ls_pets %>%
  mutate(cat = ifelse(str_detect(amenities,'cat'),1,0))

ls_pets <- ls_pets %>%
  mutate(Unspecified_Other = ifelse(cat==0 && dog==0,1,0))

ls_pets <- ls_pets %>%
  mutate(pet_category = case_when(
    cat==1 & dog ==1 ~ 'Cat and Dog',                    
    cat ==1 & dog ==0 ~ 'Cat Only',
    cat ==0 & dog==1 ~ 'Dog Only',
    cat==0 & dog==0 ~ 'Unspecified'
  ))

pets_only<-filter(ls_pets,pets==1)

mosaic(main ="Are pets allowed?",pets ~ neighbourhood_group_cleansed, ls_pets, gp = gpar(fill = cbbPalette[2:3], col = "white"), labeling = labeling_border(varnames=c(FALSE,FALSE),rot_labels = c(0,45,30,30)))

ggplot(pets_only, aes(x=neighbourhood_group_cleansed, fill=pet_category)) +
  geom_bar(position='dodge') +
  ggtitle('Pet Policy by Borough and Pet Type') +
  xlab("Neighbourhoods") +
  ylab("Number of Listings") +
  labs(fill='Pet Category') +
  scale_fill_colorblind()

From the mosaic plot we see that pet policy may only be slightly correlated with boroughts. Overall, most listings do not explicitly state that they welcome pets, only a small percentage of listings explicitly express that they are willing to accept pets. This trend is consistent across all boroughs. Brooklyn is slightly more pet friendly, compared to the others.

For the listings that do allow pets, we break them down into whether they specify what type of pets are allowed, by looking at the group bar char graph. Among the listings that allow pets, cats in general take a higher percentage and therefore are porbably more welcomed than dogs, especially in Brooklyn. Also a large number of listings only indicate that they are pet freindly, but do not restrict to specific pet types.

Part B - Text Analytics

In this part, we moved onto text analytics to reveal how listing owners portray their places and visitors describe their experiences and express their feelings after staying at a specific venue of Airbnb. Listing summary and amenities description provide a good clue to for the former question and reviews comments shed insights on the latter question. The use of adjective words are indicative of positive and negative sentiment of visitors’ experiences. Moreover, we also investigated users’ behviours on sharing their experiecens as well as change of user behaviour over time on Airbnb based on the length of reviews in the combination of different NYC boroughs, room-types or rating category.

Text analytics on Airbnb NYC dataset mainly focused on the three areas: listings summary, amenities description and review comments. The frist section is the description of preprocessing, following by the analysis of data quality and the interesting questions to address in each area. Following are the questions we asked to understand user behaviours and experiences on Airbnb:

Preprocessing

Most of the preprocessing in text analytics was done in python. The python codes can be found in the jupyter notebook TextAnalytics_preprocessing.ipynb (https://github.com/chding12/EDAV_Final_Project/blob/master/TextAnalytics_preprocessing_NLP.ipynb). Originally we have listingsAll.csv and reviewsAll.csv containing airbnb dataset in the past 10 years. After the inital cleaning process (refer to https://github.com/chding12/EDAV_Final_Project/blob/master/clean_data.Rmd) based on listingsAll.csv, we created cleaned_data.csv containing the required columns.

In TextAnalytics_preprocessing.ipynb, 4 steps are executed to complete the preprocessing for text analytics. The dataset was first loaded into pandas dataframe (https://pandas.pydata.org/) and then processed.

  1. Add rating ategory to the cleaned file We created a new column by assigning each listing with a category based their review scores. The distribution of review scores is shown in 4.1.1 ratings distribution of the Overview section using dataset from listingsAll.csv and is used as references to create 3 categories of ratings (A, B and C). The new column “rating_categ” was combined to the cleaned data to create CleanedPlus.csv.

  2. Adjective words in summary for visualization In order to plot a wordcloud of adjective words in summary, we created a table of words and its frequencies for each rating category. We first tokenized summary text in each rating category into individual words and did pos tagging to identify the part-of-speech tag of each word. As pos-tagging is case-sensitive, we only turned each words into lowercases later. The finalized data with three columns (word, freq, category) was written to adj_summary.csv, consisting of all the three rating categories. This dataset is later used in the 4.2.1 and 4.2.2 of the Summary section.

  3. Adjective words in review for visualization We worked on review comments from reviewsAll.csv. (1) The review comments was tokenized into individual words. (2) English stopwords, such as “a”, “the”, etc…, were removed from the words. (3) Post tagging. (4) Find adjective words. (5) Find nouns. (6) Count the number of adjective, noun and total words. After the preprocessing, we kept only required columns for visualization to reviewsPlus.csv. This dataset is used in 3.2 for data quality analysis and in 4.4.2 to 4.4.4 of the Review section. Moreover, we counted adjective words in each rating category and stored the data of three columns (word, freq, category) to adj_reviews.csv for wordcloud visualization in 4.4.1 of the Review section.

  4. Amenities word frequency Key words of amenities description were extracted and counted. The words and frequencies of amenities were written to amenities.csv for 4.3.1 in the Amenities section.

3. Analysis of Data Quality

listingsAll <- read.csv('listingsAll.csv', stringsAsFactors=FALSE, sep=",", na.strings=c("","NA"))
cleanedPlus <- read.csv("CleanedPlus.csv", stringsAsFactors=FALSE, sep=",")
reviewsPlus <- read.csv('reviewsPlus.csv', stringsAsFactors=FALSE, sep=",")

3.1 Analysis of data quality - review data

library(extracat)
visna(reviewsPlus, sort='c')

na_ind <- which(is.na(reviewsPlus$comments) == TRUE)
length(na_ind)
## [1] 870
nrow(reviewsPlus)
## [1] 896208
length(na_ind)/nrow(reviewsPlus)
## [1] 0.0009707568

Here we investigated the missing data pattern in the review dataset. As you can see from the above figure, only one missing data pattern is observed. In the column of comments, some data is missing. It is hard to see the very thin red color representing the number of missing rows (870) in the plot, as the percentage of missing rows are very few (less than 0.001%) given the total number of rows is 896208. As you might observe that while the column “comments” are null, the newly created columns based on “comments” from the text preprocessing (number of words, adj and noun) are not null. Even they are supposed to be null due to lack of comments, other values instead of NA during preprocessing were used in the preprocessing. Later to analyze review comments in section IV, all the rows where comments are null were removed.

3.2 Analysis of data quality - listing data of summary and review scores only

library(dplyr)
ls <- listingsAll %>%
  select(summary, review_scores_rating)

visna(ls, sort='c')

As you can see from the above plot, three missing pattern of summary and review scores in the listingsAll dataset are present. Most listings have both reviews scores and summary. Records having missing value in review scores only are around 1/3 of the complete recoreds, while a very small portion of records miss only summary or both reviews scores and summary. As we needed both summary and review scores for later text analysis, rows where either of these two columns are null were removed before ploting the reviews scores distribution in 4.1.1.

4. Main Analysis (Exploratory Data Analysis)

I.Overview

4.1.1 ratings distribution

library(dplyr)
mydata1 <- ls %>%
  dplyr::filter(!is.na(review_scores_rating))

hist(mydata1$review_scores_rating)

The purpose of this plot is to observe the distribution of review scores of each listing to come up with three rating categories. Category A indicates excellent listings with reviews scores from 90 to 100. Category B represent the good listings with reviews scores ranging between 75 to 90. Category C is a group of not-so-good listings where reviews scores are less than 75. It is interesting to notice that the majority of listings have high review scores.

II. Listing summary

4.2.1 Which adjective words appear most frequently in listing summary?

#install.packages("wordcloud") 
#install.packages("RColorBrewer") 
library(wordcloud)
library(RColorBrewer)

adj_s_all <- read.csv('adj_summary.csv', stringsAsFactors=FALSE,sep = "#")
adj_s_all$freq <- as.numeric(adj_s_all$freq)
#all
words_to_remove <- c('nan', 'll')

s1 <- adj_s_all %>%
  select(c("word", "freq")) %>%
  group_by(word) %>%
  summarize(freq=sum(freq)) %>%
  filter(!word %in% words_to_remove)

wordcloud(words = s1$word, freq = s1$freq,
                      min.freq = 1,
          max.words=200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Paired"))

What aspects listing owners think are the most important and attractive to visitors can be shown by how they describe their spaces. As we can see from the wordcloud representing most common adjective words in the listing summary, “private” is the most frequently used words to describe a listing. It corresponds to the nature of Airbnb business model as it is based on the concept of sharing hosts’ personal housing with visitors. The other common words can be divided into two groups. First group, such “beautiful”, “great”, “comfortable”, expresses abstract ideas of the venue to stimulate reader’s imagination, while the second group, such as “new”, “quiet”, “spacious”, “large”, “clean”, gives readers more hints on the physical condition of the place to bring vivid pictures into user’s imagination.

4.2.2 Are words of summary in each rating category the same?

library(ggplot2)

adj_s_all %>%
  group_by(category) %>%
  top_n(n=10, wt=freq) %>%
  ungroup() %>%
  ggplot(aes(reorder(word, freq), freq)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~category, scales = 'free') +
  labs(y = "words frequency",
       X="most frequent words") +
  coord_flip()

When listing owners describe their place, they would tend not only to portray their spaces to impress visitors but also be authentic to the real condition to certain extent. It turns out from our data exploration that these subtlties can be observed from the use of adjective words in their listing summary. We studied the summary from each rating groups and found interesting different usage of adjective words.

Top 10 common adjective words are very similar among the three rating categories like “great”, “private”, “beautiful”. Nevertheless, each group has one or two words different from the other group. Group A has the word “new”, which does not appear in group B or C. Group A and Group B both has the word “comfortable”, which is not present in group C. Both group B and C has the word “good”, which is not in the top list in group A. In group C, without “new” and “comfortable”, it has “available” instead. In conclusion, group A has “new” and “comfortable”, while group B has “comfortable” and “good”, and group C has “good” and “available”. The meaning of the three combinations already reveal the real condition of the listings to certain extent, which is confirmed by the rating groups of review scores.

III. Amenities

4.3.1 What are the most frequent amenities words?

amenities <- read.csv("amenities.csv", stringsAsFactors=FALSE,sep = "#")

amenities <- amenities[!startsWith(amenities$word, "translation missing"),]

wordcloud(words = amenities$word, freq = amenities$freq, min.freq = 1,
          max.words = 200, random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))

Apart from listing summary, amenities provide a clue how owners descrbe their spaces to make it more attractive to potential to visitors. Word frequency of amenities in listings are shown in the above wordcloud figure. When it comes to amenities, listing owners mostly focus on three areas basics, technology and safety.

  • Basics: “essentials”, “hangers”, “shampoo”, “hair dryer”
  • General Facility: “kitchen”, “heating”, “air conditioning”
  • Technology: ’Wifi“,”laptop friendly workspace“,”internet"
  • Safety: “smoke detector”, “carbon monoxide detector”, “fire extinguisher”.

Apparently, “kitchen”, “heating” and “air conditioning”, which provides comfort to a venue, are very often mentioned as part of the amenities provided by the listing. Safety is considered very important by the owners as those three words are comparably large as well. In terms of technology, “Wifi”, providing our connection to the world, is as essential as elements providing pyhsical comfort such as “kitchen”" or “heating”. Commonly mentioned items in basics include “hangers”, “hair dryer” and “shampoo”. These things are comparably easier to provide and it is nice to have them when visitors temporily stay at a place.

IV. Reviews

4.4.1 Which words appear in high-rating or low-rating reviews?

library(tidytext)
library(reshape2)
library(wordcloud)

adj_r <- read.csv("adj_review.csv", stringsAsFactors=FALSE,sep = "#")
adj_r_A <- adj_r %>%
  filter(category == 'A') %>%
  top_n(n=200, wt=freq)

adj_r_C <- adj_r %>%
  filter(category == 'C') %>%
  top_n(n=200, wt=freq)

# The palette with black:
cbbPalette_s <- c("#0072B2", "#D55E00", "#CC79A7")

# words in high-rating reviews
df_A <- adj_r_A %>%
  inner_join(get_sentiments("bing")) 

wordcloud::wordcloud(df_A$word, df_A$freq,
          ordered.colors=TRUE, max.words = 200,
          colors=cbbPalette_s[factor(df_A$sentiment)])

# words in low-rating reviews
df_C <- adj_r_C %>%
  inner_join(get_sentiments("bing"))

wordcloud(df_C$word, df_C$freq,
          ordered.colors=TRUE, max.words = 200,
          colors=cbbPalette_s[factor(df_C$sentiment)])

Visitors’ experiences after staying in a listing can be revealed by sentiment analysis on the words in their reviews. We first did sentiment analysis on the adjective words of the reivews to identify words with positive or negative sentiments. As it is shown in the above two wordclouds, mostly positive words appeared in the reviews of listings with good ratings (Group A). As we moved to group C, although we could still see a few positive words in the reviews, more negative words appeared. Moreover, some words, such as “wonderful”, “super”, and “fantastic”, to express exceptionally good feelings or opinions became less obvious or disappeared in the group C wordcloud. The choice of color is vision deficiency friendly.

4.4.2 Is the length of reviews of each room-type or borough different?

rp <- reviewsPlus[complete.cases(reviewsPlus),]

col_to_select <- c("id", "room_type", "neighbourhood_group_cleansed", 'rating_categ')
sc <- cleanedPlus %>%
  select(col_to_select)

gl <- rp %>%
  group_by(listing_id) %>%
  summarize(
    mean_adj=as.integer(mean(number_adj)),
    total_adj=sum(number_adj),
    mean_noun=as.integer(mean(number_noun)),
    total_noun=sum(number_noun),
    mean_words=as.integer(mean(number_words)),
    total_words=sum(number_words),
    review_counts=n(),
    perc_adj=total_adj/total_words,
    perc_noun=total_noun/total_words,
    perc_other=1-(perc_adj+perc_noun)
  ) %>%
  inner_join(sc, by = c("listing_id" = "id"))

Here we created columns required for analysis on reviews. In the following analysis on the length of reviews, we used the average words of reviews to represent this feature.

#unloadNamespace("viridis")   
#update.packages("viridis")
library(viridis)
library(dplyr)
library(ggplot2)
r1 <- gl %>%
  group_by(neighbourhood_group_cleansed, room_type) %>%
  dplyr::summarise(avg_words = mean(mean_words))

# Use position=position_dodge()
p <- ggplot(data=r1, aes(x=neighbourhood_group_cleansed, 
                    y=avg_words, fill=room_type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  scale_fill_viridis(discrete=TRUE, option="magma") 

p

# check the number of reviews of shared room in Staten Island
ss <- gl %>%
  filter(room_type == "Shared room") %>%
  filter(neighbourhood_group_cleansed == "Staten Island") %>%
  count()

ss
## # A tibble: 1 x 1
##       n
##   <int>
## 1     1

As each borough has its own characteristics and would attract visitors coming to NYC for differnt purposes, we would like to know if the length of reviews would be varied according to room-types in a specific borough. As we can see from the bar chart, there is no correlation of the length of reviews and boroughs. However, people tend to write longer reviews for entire home/apartment than for private room and shared room. It is worth pointing out that shared room on Staten Island has lengthy reviews compared to the other combinations. It might be due to a small number of reviews written for shared room on Staten Island, which is confirmed by finding out that there is onyl one review in the combination of these two categorical variables. The choice of color is vision deficiency friendly.

4.4.3 Is the length of review of each room-type or rating category different?

library(grid)
r2 <- gl %>%
  group_by(rating_categ, room_type) %>%
  dplyr::summarise(avg_words = mean(mean_words))

# mosaic plot  
r2$room_type <- as.factor(r2$room_type)
r2$rating_categ <- as.factor(r2$rating_categ)

r2t <- xtabs(avg_words ~ rating_categ+room_type,
                   r2)
cbbPalette <- c("#000000", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")

m2 <- vcd::mosaic(rating_categ~room_type, data=r2t,
                  gp = gpar(fill = cbbPalette[5:7]))

# bar chart
p <- ggplot(data=r2, aes(x=rating_categ, 
                    y=avg_words, fill=room_type)) +
geom_bar(stat="identity", position=position_dodge())+
scale_fill_viridis(discrete=TRUE, option="magma")+
  theme_grey(base_size = 16) +
  ggtitle("Review Length in Rating Category") +
  xlab("Rating Category") + 
  ylab("Average Number of Words")

p

ggsave(filename="images/ReviewsRatings.png", plot=p)

The length of reviews is a feature to observe how visitors would describe their good or not-so-good experiences and to see if difference exist among room types. In the mosaic plot, we can see there is an interaction between room type and rating category. While the number of words does not vary much for each rating category for shared room, it changes when people stay in entire home/apartment or private room with good experiences (rating group A or B). This observation reveals an interesing aspect of human behaviours showing that people are willing to put more efforts in terms of review length to talk about their good or wonderful experiences than average ones.

The change of review length is more prominent when we presented it using bar chart stratified by room types within each rating group. There is not difference among room types when people talk about not-so-satisfying experiences as shown in group C. In contrast, people use more words to describe their good and wonderful experiences (as shown in Group A and B) and, similar to the previous observation in section 4.4.2, the length of reviews increases from shared room, private room to entire home/apartment when visitors talk about their experiences in those difference venues. The choice of color is vision deficiency friendly.

4.4.4 Does the length of reviews become shorter or longer over time?

rpg <- rp %>%
  left_join(sc, by = c("listing_id" = "id"))

rpg$date <- as.Date(rpg$date)
rpg$month <- as.Date(cut(rpg$date, breaks = "month"))
rpg$week <- as.Date(cut(rpg$date, breaks = "week"))

monthly changes over years

r3 <- rpg %>%
  filter(date > "2014-01-01" & date <"2018-04-01") %>%
  group_by(month, room_type) %>%
  dplyr::summarise(avg_words = mean(number_words))

p <- ggplot(r3, aes(x=month, y=avg_words)) +
  geom_line(aes(color=room_type), size=1) + 
  scale_color_viridis(discrete=TRUE, option="magma") + 
  ggtitle("Review Length over Years") +
  xlab("Year") + 
  ylab("Average Number of Words")+ 
  theme_grey(base_size = 16) 
p

ggsave(filename="images/ReviewsRoomtype.png", plot=p)

weekly changes in 2017

r3 <- rpg %>%
  filter(date > "2017-01-01" & date <"2018-01-31") %>%
  group_by(week, room_type) %>%
  dplyr::summarise(avg_words = mean(number_words))

p <- ggplot(r3, aes(x=week, y=avg_words)) +
  geom_line(aes(color=room_type), size=1) 

p + scale_color_viridis(discrete=TRUE, option="magma")

One way to follow the change of users’ behaviour on Airbnb over time is to see the amount of efforts they are willing to share their experiences, suggested by the length of review in the past years. The first time series plot clearly demonstrates that the number of reviews decreases over years and is independent of room types. In general, the number of reviews drops at the beginning of a year after the peak in the previous year. Again, the number peaks around Spring and Summer. It decreases for a few months and reaches a peak at the end of a year corresponding to Christmas holidays. There exists difference of reviews length over time among room types, but the overal ddecreasing pattern is similar. The choice of color is vision deficiency friendly.

The second time series plot provides a quick closer look to see if any pattern exists within in a year when people wirte reviews in 2017. We can see that entire home/apartment and private room has similar patterns while shared room has slightly different one. Although some patterns can be observed in this plot, further exploration are required to get more insights. The choice of color is vision deficiency friendly.

V. Text Analytics Summary

In summary, we have interesting findings in our text analytics on how listing owners portray their places (adjective words used in summary and description of amenities) and user behaviours in writting reviews (words to express sentiments on accommodation, efforts spent to write reviews based on room types or experiences in the venue, change of review length over time).